White wine Quality by Shalini Ruppa Subramanian

Introduction

I chose the white wine dataset for doing exploratory data analysis using R, given my affinity towards chemistry. The primary question would be which of the chemical properties affects the quality of white wine.

We have loaded the dataset and this is what the data looks like.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The quality is a discrete variable and range from 3 to 9 with the median and mean quality at 6 and 5.878 respectively. It also appears to have a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Both fixed and volatile acidity have long positive tails, making the mean higher than the median. Fixed acidity is in the range of 3.8 to 14.2 g/dm3 with a mean of 6.855 g/dm3 and median of 6.8 g/dm3. Excluding the outliers of wines having a fixed acidity above 10 g/ dm3, fixed acidity shows a normal distribution in the range of 5 to 10. Volatile acidity is in the range from 0.08 to 1.10 g/ dm3, with a mean of 0.278 g/ dm3 and a median of 0.260 g/dm3. Excluding the outliers above 0.9 g/dm3, the volatile acidity distribution is slightly bimodal. In general, the volatile acidity is much lower than fixed acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric acid shows a long positive tail distribution. In the range of 0 to 0.8 g/dm3, the distribution appears to be normal. There are few points above 0.8 that can be considered as outliers. Some of the wines have no citric acid added as well.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The residual sugar distribution in wines is skewed to the left with a mean of 6.391 g/l and a median of 5.200 g/l. There are a lot of wines with the sugar level in the range of 1-2 g/l. There a few outliers noted above 30 g/l.

The x axis was log transformed as the data was skewed to the right. It was interesting to observe a bimodal distribution with a group of sweeter white wines and less sweet white wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The distribution of chlorides is also positively skewed. After removing the top 1% of the values, it still shows a long tail distribution with the majority of the values from 0.01 to 0.10 g/ dm^3. It has a mean of 0.04577 g/dm^3 and a median of 0.043 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The distribution for total free sulfur dioxide levels were positively skewed.

Removing the top 1% of the values, the total sulfur dioxide shows a normal distribution. It has a mean of 138.4 ppm and a median of 134 ppm. Removing the top 1% of the values, the free sulfur dioxide distribution has spikes up and down, although it has an overall bell shape curve. The free sulfur dioxide levels are observed to be lower than the total sulfur dioxide levels. It has a mean of 35.31 ppm and a median of 34 ppm.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The density distribution is positively skewed. It has a mean of 0.994 and a median of 0.9937.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH ranges from 2.7 to 3.8 following an almost normal distribution, towards the acidic taste. It has a mean and median of 3.2.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The alcohol content ranges from 8% to 14% and the distribution is random and there is a peak at about 9.5%. The mean and median are 10.51% and 10.40% respectively. The minimum alcohol content in white wines is at least 8%.

The quality of the wines are converted to a factor variable and are grouped into 3 categories of poor, neutral and good. A quality rating of 3 and 4 is grouped as ‘poor’ quality, a rating of 5 and 6 is grouped into ‘neutral’ quality and a rating of 7, 8 and 9 are grouped into good quality.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations and 13 columns. The X (id) and quality are integer values and the rest of the columns are numeric values.

What is/are the main feature(s) of interest in your dataset?

The main area of interest is the wine quality rating given by the wine tasting experts. It will be interesting to see which chemicals in the white wine contribute to a high quality white wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features that may affect the taste and wine quality could be fixed acidity, citric acid, residual sugar and alcohol content. This will be explored further in bivariate and multivariate analysis.

Did you create any new variables from existing variables in the dataset?

I created a new factor variable for the quality of the wines to categorise into three categories of poor, neutral and good. With fewer categories, it will be easier to compare the trends of increase in quality vs other input variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I did log transformation on the axis that shows the residual sugar histogram as it was skewed towards the right. After the log transformation, I observed it was a bimodal distribution with two peaks at about 2g/l and at about 9 g/l.

Bivariate Plots Section

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000

The quality of the wine has the highest correlation with alcohol content than other variables. However, let’s take a look of the white wine quality against other factors as well. Other factors are density, pH, citric acid and sulfurdioxides.

## 
##    poor neutral    good 
##     183    3655    1060
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

I converted the quality from an integer to a factor variable so that boxplots can be utilized and it is easier to see the median of the variable of interest. It appears that median alcohol level decreases as the quality increased from 3 to 5. The median alcohol content increased across the quality levels from 5 to 9. At every quality measure, there are large variances in the alcohol content observed, except for quality level of 9. At a quality of 9, the variance in the alcohol content is the lowest and the median alcohol content is the highest.

From the alcohol content vs quality_group variable, the median alcohol content in good quality wines distinctly higher than the ‘poor’ and ‘neutral’ white wines.

## wine_data$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.588   4.600   6.392  10.700  16.200 
## -------------------------------------------------------- 
## wine_data$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## wine_data$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## wine_data$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## wine_data$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## wine_data$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## wine_data$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

This shows the median residual sugar is going up and down across the quality and there are also large variances observed in the residual sugar for each quality measure. The residual sugar decreases as the quality increases from 5 to 9. It is also interesting to note that the variance in the residual sugar and the median (2.20 g/l) is least when the quality is the highest (9).

With the quality group plot vs residual sugar content, the median of the good wines is in between the ‘poor’ and ‘neutral’ wines. The median is 6.200.

The median density of the good wines are the lowest with a median of 0.9917 g/cm3.

Excluding the outliers, the median pH is between 3.1 and 3.3. I really doubt if wine tasting experts could tell differences in pH by 0.1 levels. For a quality of 9, the variance observed is very less. There are a lot of outliers for pH noted when the quality level is from 5 to 7.

## wine_data$quality_group: poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.400   6.900   7.181   7.650  11.800 
## -------------------------------------------------------- 
## wine_data$quality_group: neutral
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.876   7.400  14.200 
## -------------------------------------------------------- 
## wine_data$quality_group: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.900   6.200   6.700   6.725   7.200   9.200

## wine_data$quality_group: poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.110   0.260   0.320   0.376   0.460   1.100 
## -------------------------------------------------------- 
## wine_data$quality_group: neutral
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2771  0.3200  0.9650 
## -------------------------------------------------------- 
## wine_data$quality_group: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2653  0.3200  0.7600

The median fixed acidity is the same across the quality levels.

The median volatile acidity is the lowest in the good quality wines. This is consistent with the information, where it says higher levels of volatile acidity can lead to unpleasant and vinegar taste.

## wine_data$quality_group: poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000 
## -------------------------------------------------------- 
## wine_data$quality_group: neutral
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03700 0.04400 0.04774 0.05100 0.34600 
## -------------------------------------------------------- 
## wine_data$quality_group: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03816 0.04400 0.13500

The good quality wines have the lowest median chloride amount, compared to the poor and neutral quality wines. The median chloride amounts are not distinctly apart from each other. A lot of outliers are noticed in the chloride amounts of good quality wines.

## wine_data$quality_group: poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    85.5   119.0   130.2   177.0   440.0 
## -------------------------------------------------------- 
## wine_data$quality_group: neutral
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   111.0   140.0   142.6   173.0   344.0 
## -------------------------------------------------------- 
## wine_data$quality_group: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    34.0   101.0   122.0   125.2   146.0   229.0
## wine_data$quality_group: poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   18.00   26.63   33.50  289.00 
## -------------------------------------------------------- 
## wine_data$quality_group: neutral
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.96   47.00  131.00 
## -------------------------------------------------------- 
## wine_data$quality_group: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   33.00   34.55   42.00  108.00
## wine_data$quality_group: poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.250   0.380   0.470   0.476   0.540   0.870 
## -------------------------------------------------------- 
## wine_data$quality_group: neutral
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4700  0.4876  0.5400  1.0600 
## -------------------------------------------------------- 
## wine_data$quality_group: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4000  0.4800  0.5001  0.5800  1.0800

The median total sulfur dioxide levels in good wines is 122 mg/dm3, which is in between the poor and the neutral quality groups. However, the variance noted in the good quality wines is lower than the poor and neutral quality. According to the white wines data sheet, it is mentioned that SO2 concentrations above 50 ppm is evident in the smell and taste of the wine. This is perhaps consistent with the lot more outliers (above 200 ppm) present in poor and neutral quality wines than the good wines.

The median free sulfur dioxide is about the same in good and neutral wines, and it is higher than the poor quality wines. The variance noted in the good quality wines is lower.
The median sulphates level across the quality groups are all same.

## [1] 0.8389665

As the residual sugar increases, the density increases. Less variance in density is observed as the residual sugar increases. Perhaps there is also another factor influencing density. This is consistent with the data sheet that tells alcohol content also affects density.

## [1] -0.7801376

As the alcohol content increases the density decreases and it has a strong negative correlation of -0.708.

## [1] 0.615501

As total sulfur dioxide increases, the free sulfur dioxide increases as well. The total sulfur dioxide is the amount of free and bound form of SO2 in the wine. It is not surprising that it has a high correlation. As total sulfur dioxide increases, the variance in free sulfur dioxide increases as well.

## [1] 0.5298813

The density increases as the total sulfur dioxide increases. It is noted that increase in density is related to a decrease in wine quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Feature of interest in this case was quality of white wine. The quality of the wine has the highest positive correlation with alcohol content of 0.436 than other variables. The quality of the wine had a clear negative correlation with density of -0.307.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It was interesting that the residual sugar and density had a high positive correlation of 0.84 while the alcohol and density had a negative correlation of -0.78. Given that the fermentation process produces alcohol from the sugars, the more alcohol is produced, the less residual sugars are present.

The free sulfur dioxide and total sulfur dioxide also has a high positive correlation of 0.616. This was expected as the amount of free sulfur dioxide is a subset of total sulfur dioxide. A general intuitive relationship between the acidity and pH is that lower pH values relate to increasing acidity.

What was the strongest relationship you found?

The strongest relationship was between the residual sugar and density with a correlation of 0.839.

Multivariate Plots Section

Taking the residual sugar as constant, the good quality wines have a lower density. It is also noticed that the good quality wines are more on the left side of the residual.sugar vs density plot. This is due to the increased alcohol content with the lower residual sugar.

Keeping total sulfur dioxide constant, higher quality wines are noticed with larger amounts of free sulfur dioxide and the poor quality wines are noticed with lower amounts of free sulfur dioxide.

Alcohol content from 12% to 14% is concentrated where the amount of chlorides is within 0.05 g/dm3. As fixed acidity increases, the pH reduces. All quality of wines seem to share a similar trend. It doesn’t seem to influence the quality of wines greatly.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I thought the quality of white wines was just influenced by alcohol content when I started with the univariate plots. However, it was interesting to discover how density and residual sugar was connected with the alcohol content and they all played a part in influencing the quality of the wines.

Were there any interesting or surprising interactions between features?

Chlorides and sulfur dioxides didn’t have much impact on quality when analysing the bivariate plots. However, they had an effect on the alcohol content, which in turn affected quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine_data)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wine_data)
## m3: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar, 
##     data = wine_data)
## m4: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
##     volatile.acidity, data = wine_data)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
##     volatile.acidity + chlorides, data = wine_data)
## 
## ================================================================
##                       m1      m2       m3       m4       m5     
## ----------------------------------------------------------------
##   (Intercept)       0.582   -24.492   88.313   72.225   71.271  
##                                                                 
##   I(alcohol)        0.313     0.360    0.246    0.286    0.283  
##                                                                 
##   density                    24.728  -87.886  -71.546  -70.514  
##                                                                 
##   residual.sugar                       0.053    0.052    0.052  
##                                                                 
##   volatile.acidity                             -2.059   -2.044  
##                                                                 
##   chlorides                                             -0.692  
##                                                                 
## ----------------------------------------------------------------
##   Log-likelihood     Inf     Inf      Inf      Inf      Inf     
##   Deviance             0.0     0.0      0.0      0.0      0.0   
##   AIC               -Inf    -Inf     -Inf     -Inf     -Inf     
##   BIC               -Inf    -Inf     -Inf     -Inf     -Inf     
##   N                 4898    4898     4898     4898     4898     
## ================================================================

I tried building a linear model of quality against alcohol. I added other terms that might have an effect on the quality, like density, residual.sugar, volatile acidity and chlorides.

The linear model is chosen as it is easy to start with. However, we see that the r-squared value is only 0.3. This means only 30% of the variances in quality can be explained by the independent variables. Another limitation is that the response variable, quality of wines, is a categorical variable. It will definitely differ from person to person.


Final Plots and Summary

Alcohol and Wine Quality

Description One

This shows the alcohol content for good wines is distinctly above the poor and neutral quality white wines. Poor and neutral quality wines have almost the same median value. One possible reason is that only 3% of the data was grouped as poor quality. About 75% of the wines were grouped into the neutral quality.

Plot 2

Description Two

Density has the highest negative coorelation with the quality of white wines. The density at wine quality 3 and 4 is about the same level as the density at a quality level of 6. From the quality rating of 5 onwards, it is a clear downward trend in density as the wine quality increases.

Plot Three

Description Three

The third plot shows the interaction of density and residual sugar and the distribution of white wine quality rating. it can be seen that the good quality of wines have lower density for the same amount of residual sugar. It also shows the grouping of quality into fewer categories enabled us to see the results more clearly.


Reflection

Initially, I did not group the quality of white wines into buckets. When I came to bivariate analysis, I checked the median quality of wines across other input variables. It was difficult to make comparisons as there were seven levels of quality from 3 to 9 and no clear trends could be estabilished. So I categorised the quality into buckets and reduced from 7 different levels to 3 levels and carried out the analysis again. With this change, it was easier to make comparisons across the quality levels.

When I started with univariate plots, going through each variable in the dataset was time-consuming and I was struggling how to do the bivariate plots for every possible combination of variables. The bivariate plots helped me to focus on the pair of variables that had very high positive or negative correlation and ignore the pair of variables with alsmost zeo correlation.

Once a variable (in this case, alcohol content) affecting quality of wines was identified using the bivariate plots, I tracked on other variables affecting this variable. This lead to exploring how these variables interact with one another to affect the quality of wines. This seems to steered me in the right direction and I could complete the rest of the analysis successfully.

A linear regression model attempt shows that only about 30% of the variance in quality is explained all of the independent variables. This leads to more avenues to explore on the weightage of each independent variable that affects the quality of wines. It will also be interesting to see if this is relevant in the red wines data as well.

References

  1. R Documentation
  2. Udacity Lectures and Notes
  3. StackOverFlow